TITLE OF THE INVENTION 

SPEfiS H RECO G NIT ION-S YSTEM AND -ftBgtteD- 
BACKGROUND OF THE INVENTION 
FIELD OF THE INVENTION 

5 The present invention relates to a speech recognition 

system capable of allowing electronic equipments to be 
controlled or manipulated with uttered voices or speeches , 
and a speech recognition method for use in such a speech 
recognition system. 

10 DESCRIPTION OF THE RELATED ART 

Known speech recognition systems of this type are 
adapted to electronic equipments, such as an on-board audio 
system and an on-board navigation system. 

In an on-board audio system equipped with a speech 

15 recognition system, when a passenger says the name of a 

desired radio broadcasting station, for example, the speech 
recognition system recognizes the uttered speech and 
automatically tunes to the reception frequency of the radio 
broadcasting station based on the recognition result. This 

20 improves the operability of the on-board audio system and 
makes it easier for a passenger to use the on-board audio 
system. 

This speech recognition system also has other 
capabilities that relieve a passenger of the burden of 
25 operating an MD (Mini Disc) player and/or CD (Compact Disc) 
player. When the passenger loads an information -carrying 
recording/reproducing medium, such as an MD disc, into the 
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MD player and says the title of a musical piece recorded on 
that recording/reproducing medium, for example, the speech 
recognition system recognizes the uttered speech and 
automatically plays the selected musical piece. 

An on -board navigation system equipped with a speech 
recognition system is provided with a capability of 
recognizing a speech uttered by a driver or the like to 
specify the name of the destination and displaying a map 
showing the route from the present location to the 
destination. This capability allows the driver to 
concentrate on driving a vehicle, thus ensuring safer 
driving environments. 

The above-described conventional speech recognition 
systems are designed to cope with a single person who utters 
words of instructions. The conventional speech recognition 
systems therefore have only a single microphone for 
inputting speeches provided at a location nearest to a 
driver who is very likely to use the microphone. 

Other passengers who are seated far from the micropherre 
should therefore utter large voices toward the^«±cr^phone to 
secure a sufficient input voice leveJ^Toimprove the 

peech recognition precis iojj^Jr^such a speech recognition 
system, other passejigers^than the driver should also utter 
large voices^toward the microphone to input uttered speeches 
into the^microphone without being affected by noise in a 
vehicle. 
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Accordingly, it is an object of the present^jjiveTTtion 
J\o provide a speech recognition^^s^m^h^ an improved 

/ operability and can^aa^ovTmore than one person to secure a 
sufficieni-rtnput voice level without uttering large voices 
without being affected by ambient noise. 

It is another object of this invention to provide a 
speech recognition method for use in a speech recognition 
system, which improves the operability of the speech 
recognition system. 
10 To achieve the first object, according to one aspect of 

this invention, there is provided a speech recognition 
system which comprises a plurality of voice pickup sections 
for picking up uttered voices; a determination section for 
determining a speech signal suitable for speech recognition 
15 from speech signals output from the plurality of voice 
pickup sections; and a speech recognizer for performing 
speech recognition based on the speech signal determined by 
the determination section. 

According to another aspect of this invention, there is 
provided a speech recognition method for a speech 
recognition system having a plurality of voice pickup means 
for picking up voices, which comprises a determination step 
of determining a speech signal suitable for speech 
recognition from speech signals output from the plurality of 
25 voice pickup means; and a speech recognition step of 

performing speech recognition based on the speech signal 
determined by the determination step. 
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In the speech recognition system or speech recognition 
method, that of the speech signals output from the plurality 
of voice pickup sections (voice pickup means) whose speech 
level is equal to or higher than a predetermined speech 
level and continues over a predetermined period of time may 
be determined as the speech signal suitable for speech 
recognition. 

It is preferable that the determination section (or 
step) acquires an average S/N value and average voice power 
of each of the speech signals output from the plurality of 
£J voice pickup sections (or voice pickup means) and determines 

that of the speech signal whose average S/N value and 

jjj average voice power are greater than respective 

fn 

predetermined threshold values as the speech signal suitable 
15 for speech recognition. 

In this case, it is preferable that the determination 
section determines a candidate order of those speech signals 
whose average S/N values and average voice powers are 
greater than the respective predetermined threshold values 
20 and which are candidates for the speech signal suitable for 
speech recognition, in accordance with the average S/N 
values and average voice powers; and the speech recognizer 
sequentially executes speech recognition on the candidates 
in accordance with the candidate order from a highest 
25 candidate to a lower one. 

In any of the speech recognition system and method and 
their preferable modes, the determination section (or step) 
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treats those of the speech signals which are other than the 
speech signal suitable for speech recognition as noise 
signals. 

In any of the speech recognition system and method and 
their preferable modes, of other speech signals than the 
speech signal suitable for speech recognition, that speech 
signal whose average S/N value and average voice power 
become minimum may be treated as a noise signal by the 
determination section. 

With the above structures, when a speaker makes a 
desired speech, a speech signal and a noise signal suitable 
for speech recognition are automatically determined from the 
individual speech signals output from a plurality of voice 
pickup sections (or voice pickup means) and speech 
recognition is carried out based on the determined speech 
signal and noise signal. Accordingly, the speaker has only 
to utter words or voices without consciously making such a 
speech to a specific voice pickup section. This leads to an 
improved operability of the speech recognition system. 
20 BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram illustrating the structure of 
a speech recognition system according to one embodiment of 
the present invention; 

FIG. 2A is a plan view exemplifying the layout of 
25 microphones in an ordinary 4 -seat vehicle; 

FIG. 2B is a plan view showing another layout of 
microphones in an ordinary 4 -seat vehicle; 
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FIG. 3A is a plan view exemplifying the layout of 
microphones in a wagon or the like; 

FIG. 3B is a plan view showing another layout of 
microphones in a wagon or the like; 

FIG. 4 is a block diagram showing the structures of a 
multiplexer, a demultiplexer and a storage section; 

FIG. 5 is a timing chart for explaining the timings of 
sampling an input signal and storing sampled signals into a 
storage section; 

FIGS. 6A through 6D are explanatory diagrams for 
explaining how to compute an average voice power, an average 
noise power and an average S/N value; 

FIG. 7 is an explanatory diagram showing the structure 
of a speech condition table; 

FIG. 8 is an explanatory diagram showing the structure 
O of a noise selection table; 

Q FIG. 9 is a flowchart for explaining the operation of 

the speech recognition system according to this embodiment; 

FIG. 10 is a flowchart for further explaining the 
operation of the speech recognition system according to this 
embodiment ; and 

FIG. 11 is a block diagram illustrating the structure 
of a modification of the speech recognition system according 
to this embodiment. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 
With reference to the accompanying drawings, a 
description will now be given of a preferred embodiment of 
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the present invention as adapted to a speech recognition 
system which can ensure voice- or speech-based control or 
manipulation of an electronic equipment installed in a 
vehicle, such as an on-board audio system or an on-board 
navigation system. 

FIG. 1 is a block diagram illustrating the structure of 
a speech recognition system according to this embodiment of 
this invention. Referring to this diagram, the speech 
recognition system comprises a plurality of microphones M x to 
M N as voice pickup means, a plurality of pre-circuits CC X to 
CC N , a multiplexer 1, an A/D (Analog- to-Digital ) converter 
(ADC) 2, a demultiplexer 3 # a storage section 4, a speech 
detector 5, a data analyzer 6, a speech recognizer 7, a 
controller 8 and a speech switch 9 . 

The pre-circuits CCi-CC,,, the multiplexer 1, the A/D 
converter 2, the demultiplexer 3, the storage section 4, the 
speech detector 5, the data analyzer 6 and the controller 8 
constitute determination means which determines a speech 
signal and noise signal suitable for speech recognition. 

The single speech switch 9 is provided in the vicinity 
of a driver seat, for example, on a front dash board or one 
end of a front door by the driver seat. 

The controller 8 has a microprocessor (MPU) , which 
controls the general operation of this speech recognition 
system. When the speech switch 9 is switched on, sending an 
ON signal SW to the microprocessor, the microprocessor 
causes the microphones M x -M u to initiate a voice pickup 



operation. 

The speech detector 5 has number -of -speeches counters 
FCj-FCpi that are used to determine to which microphone an 
uttered speech is directed, though their details will be 
5 given in a later description of the operation of the speech 
recognition system. 

The individual microphones M X -M N are provided at 
locations where it is easy to pick up speeches uttered by 
individual passengers, e.g., in the vicinity of the 
10 individual passenger seats including the driver seat. 

In one example where four microphones M 1 -M 4 are placed 
in a 4-seat vehicle, the microphones M x and M 2 are placed in 
front of the driver seat and the front passenger seat and 
the microphones M 3 and M 4 are placed in front of the rear 
15 passenger seats, e.g., the corresponding roof portions or at 
the back of the driver seat and the front passenger seat as 
shown in a plan view of FIG. 2A. This way, the individual 
microphones M x -M 4 are associated with the respective 
passengers . 

20 In another example as shown in a plan view of FIG. 2B, 

the microphones M 1 and M 2 may be placed in the front door by 
the driver seat and the front door by the front passenger 
seat and the microphones M 3 and M 4 are placed in the rear 
doors by the respective rear passenger seats , so that the 
25 individual microphones M 1 -M 4 are associated with the 
respective passengers. 

In a further example, the microphones M^M^ may be 
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provided at combined locations shown in FIGS. 2A and 2B. 
Specifically, the microphone M x is placed in front of the 
driver seat as shown in FIG. 2A or in the front door by the 
driver seat as shown in FIG. 2B, so that a single microphone 
is provided for the driver who sits on the driver seat. 
Likewise, either the location shown in FIG. 2A or the 
location shown in FIG. 2B is selected for any of the 
remaining microphones M 2 -M 4 . 

In the case of a wagon type vehicle or the like which 
holds a greater number of seats, for example, a greater 
number of microphones M^Mg are provided in accordance with 
the seats and at the locations where it is easy to pick up 
speeches uttered by individual passengers, as shown in plan 
views of FIGS. 3A and 3B. Note that the microphones M x -M 6 
may be provided at combined locations shown in FIGS. 3A and 
3B as per the aforementioned case of the 4 -seat vehicle. 

It is to be noted that the aforementioned microphone 
layouts have been given simply as examples, and are to be 
considered as illustrative and not restrictive. Actually, 
system information that is used in the speech recognition 
system of this invention is constructed beforehand in 
consideration of the characteristics of voice transmission 
from individual passengers to the respective microphones . 
Strictly speaking, therefore, the conditions for setting the 
microphones are not restricted at all. Further, the number 
of microphones can be determined to be equal to or smaller 
than the number of maximum passengers predetermined in 



accordance with the type of a vehicle. 

The layout of the individual microphones is not limited 
to a simple layout that makes the distances between the 
microphones to the respective passengers equal to one 
another. Those distances and the locations of the 
individual microphones may be determined based on the 
results of analysis of the voice characteristics in a 
vehicle previously acquired through experiments or the like 
in such a way that the characteristics of voice transmission 
from the microphones to the respective passengers become 
substantially the same. 

Returning to FIG. 1, the microphones M X -M N are connected 
to the respective pre-circuits CCi-CC^, thus constituting N 
channels of signal processing systems. 

Each of the pre-circuits CCi-CC^ has an amplifier (not 
shown) which amplifies the amplitude level of the associated 
one of input speech signals S 1 to S N , supplied from the 
microphones M^M,,, to the level that is suitable for signal 
processing, and a band-pass filter (not shown) which passes 
only a predetermined frequency component of the amplified 
input speech signal. The pre-circuits CCi-CC,, supply input 
speech signals S 1 t to S N ', which have passed the respective 
band-pass filters, to the multiplexer 1. 

Each band-pass filter is set with a low cut-off 
frequency f L (e.g., f L = 100 Hz) for eliminating low- 
frequency noise included in the associated one of the input 
speech signals S^S^ and a high cut-off frequency f H in 



consideration of the Nyquist frequency. The low cut-off 
frequency f L and high cut-off frequency f H are set so that 
the frequency range of voices that human beings utter is 
included in the range between those two frequencies. 

As shown in FIG. 4, the multiplexer 1 comprises analog 
switches AS 1 to AS N for N channels. The input speech signals 
S^-S^ from the pre-circuits CC^-CC,, are supplied to the 
input terminals of the respective analog switches AS X -AS N 
whose output terminals are connected together to the A/D 
converter 2. In accordance with channel switch signals CH X 
to CH N supplied from the controller 8, the analog switches 
ASi-ASw exclusively switch the input speech signals Si'-S,/ 
and supply the switched input speech signals to the 

A/D converter 2 . 

The A/D converter 2 convert the input speech signals 
Si'-S N f , sequentially supplied from the multiplexer 1, to 
digital input data D x to D N in synchronism with a 
predetermined sampling frequency f , and supplies the digital 
input data D L -D N to the demultiplexer 3. 

The sampling frequency f is set by a sampling clock 
CK^c from the controller 8 and is determined in 
consideration of anti-aliasing. More specifically, the 
sampling frequency f is determined to be equal to or higher 
than approximately twice the high cut-off frequency f H of the 
band-pass filter, and is set, for example, in a range of 8 
kHz to 11 kHz. 

The demultiplexer 3 comprises analog switches AW X to AW N 



for N channels, as shown in FIG. 4. The analog switches AW X - 
AW N have their input terminals connected together to the 
output terminal of the A/D converter 2 and their output 
terminals respectively connected to memory areas ME X to ME N 
for N channels provided in the storage section 4 . In 
accordance with the channel switch signals CH 1 -CH N supplied 
from the controller 8, the analog switches AW^AWn 
exclusively switch the input data D X -D N and supply the 
switched input data D X -D N to the respective memory areas ME-l- 
ME N . 

Referring now to the timing chart in FIG. 5, the 
operations of the multiplexer 1, the A/D converter 2 and the 
demultiplexer 3 will be explained. When the speech switch 9 
is set on, the resultant ON signal SW is received by the 
controller 8 which in turn outputs the sampling clock CK ADC 
and the channel switch signals CJ^-CHj^ . 

The sampling clock CK^ has a pulse waveform which 
repeats the logical inversion N times during a period 
(sampling period) T which is the reciprocal, 1/f, of the 
sampling frequency f . The channel switch signals 0^-0^ 
have pulse waveforms which sequentially become logic "1" 
every period T/N of the sampling clock CK ADC . 

The multiplexer 1 exclusively performs switching 
between enabling and disabling of the input speech signals 
S-l'-S/ in synchronism with the period T/N in which the 
channel switch signals CH X -CH N sequentially become logic "1". 
As a result, the input speech signals S^-Sj,' are 




sequentially supplied to the A/D converter 2 in synchronism 
with the period T/N to be converted to the digital data D X -D N . 
The demultiplexer 3 likewise exclusively performs switching 
between enabling and disabling of the input data D X -D N in 
5 synchronism with the period T/N in which the channel switch 
signals CHi-Cf^ sequentially become logic "1". Accordingly, 
the input data D^D,, from the A/D converter 2 are distributed 
and stored in the respective memory areas ME^MEn in 
synchronism with the period T/N. 
p 10 As sampling N channels of input speech signals -S N ' 

pi in the sampling period T (= 1/f) is repeated this way, it is 

til 

m possible to generate N channels of input data D X -D N with even 

ijl the single A/D converter 2 in synchronism with the sampling 

S frequency f and to store the input data D X -D N into the 

Q 

g3 15 predetermined memory areas ME^MEn, respectively. 

Pj The storage section 4, which is constituted by a 

P semiconductor memory, has the aforementioned memory areas 

MEi-MEj, for N channels. That is, the memory areas ME X -ME N are 
provided in association with the microphones M 1 -M N . 
20 As shown in FIG. 4, each of the memory areas ME^MEn has 

a plurality of frame areas MF X , MF 2 and so forth for storing 
the associated one of the input data D^Dj^ frame by frame of 
a predetermined number of samples . 

Referring to the memory area ME X , for example, the 
25 frame areas MF X , MF 2 and so forth sequentially store the 
input data D 1 supplied from the demultiplexer 3 by a 
predetermined number of samples (256 samples in this 
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embodiment) in accordance with an address signal ADR, from 
the controller 8. That is, every 256 samples of the input 
data D x are stored in each frame area MF lt MF 2 or the like in 
each frame period TF which is 256 x T as shown in FIG. 5. 
Input data for one frame period (1TF), which is stored in 
each frame area MF 1# MF 2 or the like, is called "frame data". 

Likewise, the input data D 2 -D N are stored, 256 samples 
each, in the frame area MF lr MF 2 and so forth in the 
remaining memory areas ME 2 -ME N in each frame period TF. 

The speech detector 5 and the data analyzer 6 are 
constituted by a DSP (Digital Signal Processor). 

Every time frame data is stored in the frame area MF 1# 
MF 2 and so forth in each of the memory areas ME^ME,,, the 
speech detector 5 computes the LPC (Linear Predictive 
15 Coding) residual of the latest frame data and determines if 
the computed value is equal to or greater than a 
predetermined threshold value THD1 . When the computed value 
becomes equal to or greater than the predetermined threshold 
value THD1, the speech detector 5 determines that the latest 
frame data is speech frame data produced from a speech. 
When the computed value is smaller than the predetermined 
threshold value THD1 , the speech detector 5 determines that 
the latest frame data is input data that has not been 
produced from a speech, i.e., noise frame data that has been 
25 produced by noise in a vehicle. 

When the computed LPC residual value becomes equal to 
or greater than the predetermined threshold value THD1 over 
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three frame periods (3TF), the speech detector 5 settles 
that the frame data over the three frame periods (3TF) is 
definitely speech frame data produced from a speech and 
transfers speech detection data DCT1 indicative of the 
result of the decision to the controller 8. 

More specifically, the LPC residuals of frame data 
stored in the individual frame area MF lt MF 2 and so forth in 
each of the memory areas ME X -ME N are individually computed 
channel by channel, and each channel -by- channel computed LPC 
residual value is compared with the threshold value THD1 to 
determine, channel by channel, if the frame data is speech 
frame data produced from a speech. 

Given that £ x is the computed LPC residual value of the 
first channel associated with the microphone M 1# £ 2 is the 
computed LPC residual value of the second channel associated 
with the microphone M 2 and likewise £ 3 to £ N are the 
computed LPC residual values of the third to N-th channels 
respectively associated with the microphones M 3 -M N , the 
computed values £].-£ N are compared with the threshold value 
THD1. The frame data that corresponds to the channel whose 
computed LPC residual value becomes equal to or greater than 
the threshold value THD1 is determined as speech frame data 
that has been generated from a speech. Further, the speech 
frame data that corresponds to the channel whose computed 
LPC residual value becomes equal to or greater than the 
threshold value THD1 over three frame periods (3TF) is 
settled as speech frame data that is definitely generated 



from a speech. 

When a speech has been directed to the microphone M x 
and the uttered voices have not been input to the remaining 
microphones M 2 -M N , for example, only the frame data that is 
stored in the memory area ME X of the channel associated with 
the microphone M x is determined and settled as speech frame 
data that has been produced from the speech, and the frame 
data stored in the memory areas ME 2 -ME N associated with the 
remaining microphones M 2 -M N are determined as noise frame 
data generated from noise in the vehicle. 

When a speech has been directed to the microphone M 1 
and the uttered voices have reached the microphone M 2 but not 
the remaining microphones M 3 -M N , for example, the frame data 
stored in the memory areas ME 1 and ME 2 of the channels 
associated with the microphones M x and M 2 are both determined 
and settled as speech frame data produced from the speech, 
and the frame data stored in the memory areas ME 3 -ME N 
associated with the remaining microphones M 3 -M N are 
determined as noise frame data. 

In the above -described manner, the speech detector 5 
computes the LPC residual of each of the frame data stored 
in the memory areas ME X -ME N , compares it with the threshold 
value THD1 to determine if uttered voices have been input to 
any microphone and determine the frame period in which the 
uttered voices have been input, and transfers the speech 
detection data DCT1 having information on those decisions to 
the controller 8. 
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The speech detection data DCT1 is transferred to the 
controller 8 as predetermined code data which indicates the 
memory area where speech frame data has been stored over the 
aforementioned three frames or more (hereinafter this memory 
area will be called "speech memory channel") and its frame 
area (hereinafter called "speech memory frame"). 

Specifically, the speech detection data DCT1 has an 
ordinary data structure of, for example, DCT1{CH 1 (TF 1 , TF 2 - 

TFJ, CH 2 (TF lf TF 2 -TF m ) CH N (TF 1# TF 2 -TF m )}. CH r CH N are 

flag data representing the individual channels, and TF 1# TF 2 - 
TF m are flag data corresponding to the individual frame areas 
MF 1# MF 2 -MF m . 

When an uttered speech is input only to the microphone 
M 1 and speech frame data is stored in the third and 
subsequent frame areas MF 3 , MF 4 and so forth, speech 
detection data DCT of binary codes of DCT1{ 1 ( 0 , 0 , 1 , 1- 1 ) , 
0(0,0,0-0), 0(0,0,0-0) is transferred to the controller 

8. 

When the speech detection data DCT1 is transferred to 
the controller 8, the controller 8 generates control data 
CNT1 indicating the speech memory channel and speech memory 
frame based on the speech detection data DCT1, and sends the 
control data CNT1 to the data analyzer 6 

The data analyzer 6 comprises an optimal- speech 
determining section 6a, a noise determining section 6b, an 
average- S/N computing section 6c, an average -voice -power 
computing section 6d, an average-noise-power computing 



section 6e, a speech condition table 6f and a noise 
selection table 6g. When receiving the control data CNT1 
from the controller 8, the data analyzer 6 initiates a 
process of determining speech frame data and noise frame 
data suitable for speech recognition. 

The average -voice -power computing section 6d acquires 
information on the speech memory channel and speech memory 
frame from the control data CNT1, reads speech frame data 
from the memory area that corresponds to those speech memory 
channel and speech memory frame and computes average voice 
power P(n) of the speech frame data channel by channel. The 
variable n in the average voice power P(n) indicates a 
channel number . 

When speech frame data is stored in the memory areas 
ME 1 -ME 4 corresponding to the channels CH^CH^ as shown in FIGS. 
6 A to 6D, for example, the average voice power P(l) to P(4) 
of plural pieces of speech frame data corresponding to a 
plurality of predetermined frame periods (m 2 x TF) from a 
time t s at which a speech has started are computed channel by 
channel. The average voice power P(n) is computed by 
obtaining the sum of squares of speech frame data in the 
frame periods (m 2 x TF) and then dividing the sum by the 
number of the frame periods (m 2 x TF) . 

The average -noise -power computing section 6e acquires 
information on the speech memory channel and speech memory 
frame from the control data CNT1, reads noise frame data 
preceding the speech frame data by a plurality of frame 
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periods (n^ x TF) from the memory area that corresponds to 
those speech memory channel and speech memory frame and 
computes average noise power NP(n) of the noise frame data 
channel by channel. The variable n in the average noise 
power NP(n) indicates a speech channel, and the average 
noise power NP(n) is computed by obtaining the sum of 
squares of noise frame data in the frame periods (m x x TF) 
and then dividing the sum by the number of the frame periods 
( m 1 x TF ) . 

When speech frame data is stored in the memory areas 
ME 1 -ME 4 corresponding to the channels CH 1 -CH 4 as shown in FIGS. 
6A to 6D, for example, the average noise power NP(n) of 
plural pieces of noise frame data preceding by a plurality 
of frame periods (m x x TF) from the time t s at which a speech 
has started (at which storage of the speech frame data has 
started) are computed. 

The average-S/N computing section 6c computes an 
average S/N value SN(n) which represents the value of the 
signal-to-noise ratio for each speech channel based on the 
average voice power P(n) computed by the average -voice -power 
computing section 6d and the average noise power NP(n) 
computed by the average-noise-power computing section 6e. 

In the case where the channels CH 1 -CH 4 are speech 
channels as shown in FIGS. 6A to 6D, for example, the 
average S/N values SN(1) to SN(4) of the individual channels 
CHi-C^ are computed from the following equations 1 to 4 . 

SN(1) = P(1)/NP(1) ... (1) 



SN(2) = P(2)/NP(2) 



(2) 



SN(3) = P(3)/NP(3) 



(3) 



SN(4) = P(4)/NP(4) 



(4) 



Logarithmic values of the average S/N values SN(1) to 
SN(4) computed from the equations 1 to 4 may be taken as the 
average S/N values SN(1)-SN(4) of the individual channels 
CH 1 - CH 4 . 

The optimal- speech determining section 6a compares the 
average S/N value SN(n) acquired by the average -S/N 
computing section 6c with a predetermined threshold value 
THD2, and compares the average voice power P(n) acquired by 
the average -voice -power computing section 6d with a 
predetermined threshold value THD3. The optimal- speech 
determining section 6a then collates the results of the 
comparison with the speech condition table 6f shown in FIG. 
7 to determine which channel of speech frame data is 
suitable for the speech recognition process . 

As shown in FIG. 7 , the speech condition table 6f is 
storing reference data for ranking speech frame data in 
accordance with the relationship between the average S/N 
value and the threshold value THD2 and the relationship 
between the average voice power and the threshold value THD3. 
Referring to the speech condition table 6f based on the 
comparison results, the optimal- speech determining section 
6a ranks the speech frame data suitable for speech 
recognition and determines the speech frame data of the 
highest rank as the one suitable for speech recognition. 
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Specifically, the' optimal -speech determining section 6a 
determines the speech frame data whose average S/N value is 
equal to or greater than the. threshold value THD2 and whose 
average voice power is equal to or greater than the 
threshold value THD3 as a rank 1 (Rnkl), determines the 
speech frame data whose average S/N value is equal to or 
greater than the threshold value THD2 and whose average 
voice power is less than the threshold value THD3 as a rank 
2 (Rnk2), determines the speech frame data whose average S/N 
value is smaller than the threshold value THD2 and whose 
average voice power is equal to or greater than the 
threshold value THD3 as a rank 3 (Rnk3), and determines the 
speech frame data whose average S/N value is smaller than 
the threshold value THD2 and whose average voice power is 
less than the threshold value THD3 as a rank 3 (Rnk3). 

Further, the optimal -speech determining section 6a 
determines the speech frame data in all the channels of 
speech frame data whose average S/N value and average voice 
power become maximum as a rank 0 ( RnkO ) . 

Then, the optimal -speech determining section 6a 
determines the speech frame data that becomes the rank 0 
(RnkO) as a candidate most suitable for speech recognition 
(first candidate). Further, the optimal -speech determining 
section 6a determines the speech frame data that becomes the 
rank 1 (Rnkl) as the next candidate suitable for speech 
recognition (second candidate). When there are a plurality 
of channels whose speech frame data become the rank 1 (Rnkl), 
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those speech frame data which have greater average S/N 
values and greater average voice powers are determined as 
candidates of higher ranks. 

Further, the optimal- speech determining section 6a 
removes the speech frame data that correspond to the rank 2 
(Rnk2) to the rank 4 (Rnk4) from the targets for speech 
recognition, considering that they are unsuitable for speech 
recognition. 

In short, the optimal- speech determining section 6a 
compares the average S/N value SN(n) and the average voice 
power P(n) with the threshold values THD2 and THD3 
respectively, collates the comparison results with the 
speech condition table 6f shown in FIG- 7 to determine the 
speech frame data that is suitable for speech recognition, 
and then puts a priority order or ranking to speech frame 
data suitable for speech recognition. Then, the optimal- 
speech determining section 6a transfers speech candidate 
data DCT2 indicating the ranking to the controller 8 . 

The noise determining section 6b collates combinations 
of all the ranks for N channels that are acquired by the 
optimal- speech determining section 6a with the noise 
selection table 6g shown in FIG. 8, and determines any 
channel for which the ranking combination has a match as a 
noise channel. 

When the ranks of the individual channels starting at 
the first channel CF^ are (RnkO), (Rnkl), (Rnk2), (Rnkl), ... 
for example, the noise determining section 6b determines the 




third channel CH 3 as a noise channel. Then, the noise 
determining section 6b sends noise candidate data DCT3 to 
the controller 8. 

When the optimal -speech determining section 6a 
determines a candidate of speech frame data suitable for 
speech recognition, the noise determining section 6b 
determines a noise channel corresponding to the candidate of 
speech frame data suitable for speech recognition by 
referring to the individual "cases" in FIG, 8. Accordingly, 
a candidate of speech frame data suitable for speech 
recognition and noise data obtained by the microphone that 
has picked up noise are determined in association with each 
other . 

The individual cases 1, 2, 3 and so forth in th^aoIsT 
selection table 6g in FIG. 8 are preset^J^as^S^on the results 
of experiments on the voice^c^^ract eristics obtained when 
passengers actual^ly^uttered voices at various positions in a 
vehicle^rfwhich all the microphones M^Mj, were actually 
ir^^^falled. 

When the speech candidate data DCT2 and the noise 
candidate data DCT3 are supplied to the controller 8, the 
controller 8 accesses that of the memory areas ME X -ME N which 
corresponds to the channel of the first candidate based on 
the speech candidate data DCT2 , reads the speech frame data 
most suitable for speech recognition and supplies it to the 
speech recognizer 7 . 

The speech recognizer 7 performs known processes, such 

- 23 - 



as SS (Spectrum Subtraction), echo canceling, noise 
canceling and CMN, based on the speech frame data and noise 
frame data supplied from the storage section 4 to thereby 
eliminate a noise component from the speech frame data, 
performs speech recognition based on the noise -component 
removed speech frame data and outputs data Dout representing 
the result of speech recognition. 

If an adequate speech recognition result is not 
acquired from the speech recognition performed by the speech 
recognizer 7 based on speech frame data and the noise frame 
data suitable for speech recognition, the controller 8 
accesses the memory area that corresponds to the channel of 
the next candidate suitable for speech recognition and 
transfers the corresponding speech frame data to the speech 
recognizer 7. Thereafter, the controller 8 supplies speech 
frame data of the channels of subsequent candidates in order 
to the speech recognizer 7 until the adequate speech 
recognition result is acquired. 

An example of the operation of this speech recognition 
system which has the above- described structure will be 
discussed with reference to the flowcharts shown in FIGS. 9 
and 10. FIG. 9 illustrates an operational sequence from the 
pickup of sounds with the microphones M^Mp, to the storage of 
the input data D X -D N into the storage section 4 as frame data, 
and FIG. 10 illustrates the operation at the time the data 
analyzer 6 determines optimal speech frame data and noise 
frame data. 



In FIG. 9, the speech recognition system stands by 
until the speech switch 9 is switched on in step 100. Upon 
occurrence of the ON event of the speech switch 9, the flow 
goes to step 102 to perform initialization. This 
initializing process clears a count value n of a channel- 
number counter, a count value m of a frame-number counter 
and all values F(l) to F(N) of the number -of -speeches 
counters FC^FC,,, all provided in the controller 8. 

The channel -number counter is provided to designate 
each of the channels of the microphones M.-M,, with the count 
value n. The frame-number counter is provided to designate 
the number (address) of each of the frame areas MF X , MF 2 . MF 3 
and so forth, provided in the each of the memory areas ME X - 
ME N , with the count value m. 

N number- of -speeches counters FC^FC,, are provided in 
association with the individual channels. That is, the 
first number- of -speeches counter FC^ is provided in 
association with the first channel, the second number-of- 
speeches counter FC 2 is provided in association with the 
second channel, and so forth to the N-th number- of -speeches 
counter FC N provided in association with the N-th channel. 
The number-of-speeches counters FC.-FC,, are used to determine 
whether or not an LPC residual £ n greater than the threshold 
value THD1 has consecutively continued over three or more 
frames and to determine the channel for which the LPC 
residual £„ has continued over three or more frames. The 
number-of-speeches counters FC^FC,, are also used to 



determine, as a speech- input channel, the channel for which 
the LPC residual £ n has continued over three or more frames. 

In the next step 104, the first frame area MF 1 of each 
of the memory areas ME^MEn is set. That is, the number, m, 
5 of the frame area is set to m = 1 . 

In subsequent steps 106 and 108, the microphones M^Mj, 
start picking up sounds and the input data D X -D N acquired by 
the voice pickup are stored in the individual first frame 
areas MF X of the memory areas ME^MEj^ frame by frame. 

10 When one frame of input data D X -D N is stored, the memory 

area ME X that corresponds to the first (n = 1) channel is 
designated in step 110, and the LPC residual £ n (n = 1) of 
frame data stored in the first (m = 1) frame area MF 1 of the 
memory area ME 1 is computed in step 112. 

15 In the next step 114, the LPC residual £ n is compared 

with the threshold value THD1. When £ n ^ THD1 , the flow 
goes to step 116 to increment (or adds "1" to) the value 
F(l) of the number- of -speeches counter FC X corresponding to 
the first channel by "1". When £ n < THD1 , the flow goes to 

20 step 118 to clear the value F(l) of the number-of -speeches 
counter FC X . 

When £ n becomes equal to or greater than THD1 ( £ n ^ 
THD1), therefore, the value F(l) of the number-of -speeches 
counter FC 1 becomes "1" which indicates that one frame of 
25 speeches has been input to the microphone M x of the first 
channel . 

When £ n becomes smaller than THD1 ( £ n < THD1), on the 
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other hand, the value F(l) of the number- of -speeches counter 
FC X is cleared to "0" which indicates that no speeches have 
been input to the microphone M x of the first channel. 

Next, it is checked if n is equal to N (n = N) in step 
120 to determine whether the LPC residual £ n in every 
channel has been computed. When n = N is not met, the flow 
goes to step 122 to make n = n + 1 to set the next channel, 
and the sequence of processes from step 112 is repeated. 
That is, by repeating the processes of steps 112 to 122, the 
LPC residual £ n of frame data stored in the frame area MF 1 
of each of the memory areas ME X -ME N is compared with the 
threshold value THD1 . When the LPC residual £ n becomes 
equal to or greater than the threshold value THD1 , the value 
F(n) of the number- of -speeches counter FC X corresponding to 
that channel number n is incremented by "1". 

When n = N is met in the aforementioned step 120, it is 
determined that the processing for all the channels has been 
completed, then the flow proceeds to step 124. 

In step 124, it is determined if any one of the values 
F(l) to F(N) of the number- of -speeches counters FC X -FC N has 
become equal to or greater than "3". If there is no such a 
count value, i.e., if any of the values F(l) to F(N) is 
equal to or smaller than ,, 2" , the flow goes to step 126. 

In step 126, the individual second frame areas MF 2 of 
the memory areas ME L -ME N 1 are set by setting m = m + 1. 
Then, the processes of steps 106 to 124 are repeated. 

Accordingly, the input data is stored in each frame 



area MF 2 (steps 106 and 108). the LPC residual £ n of each 
frame data stored in each frame area MF 2 is compared with the 
threshold value THD1 (steps 110 to 114), and each of the 
values F(l) to F(N) of the number-of -speeches counters FC^ 
FC N is incremented or cleared based on the comparison results. 

In step 124, it is determined again if any one of the 
values F(l) to F(N) of the number-of -speeches counters FC,- 
FC N has become equal to or greater than "3". If there is no 
such a count value, the flow goes to step 126 to set m = m + 
1 so that the next frame areas MF 3 of the memory areas MB,- 
ME N 1 are set. Then, the processes of steps 106 to 124 are 
repeated. 

As the processes of steps 106 to 124 are repeated and 
at least one of the values F(l) to F(N) of the number-of - 
speeches counters F Cl -FC N becomes equal to or greater than 
"3", the flow proceeds to step 128. 

in other words, in step 124. the values F(l) to F(N) of 
the number-of-speeches counters FC.-FC. are checked and only 
when the LPC residual £ n greater than the threshold value 
THD1 consecutively continues over three or more frames, 
frame data stored in the memory area corresponding to that 
channel is determined and settled as speech frame data. 

in the next step 128, it is determined if the value of 
the number-of-speeches counter for which it was determined 
the LPC residual £ n greater than the threshold value THD1 
consecutively continued over three or more frames has 
reached "5". If that value has not reached "5" yet. the 
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process in step 126 is carried out after which the processes 
of steps 106 to 128 are repeated. 

There may be a case where when the value F(n) of the 
number-of-speeches counter that corresponds to a given 
channel n becomes "3", the value of the number-of-speeches 
counters corresponding to the remaining channels is "1" or 
"2". In this case, frame data stored in the memory areas 
corresponding to the remaining channels are likely to be 
also speech frame data. 

To cope with this case, therefore, the processes of 
steps 106 to 128 are repeated twice to check if the frame 
data stored in the memory areas corresponding to the 
remaining channels are speech frame data. 

When the decision in step 128 is "YES", the flow goes 
15 to step 130 where the speech detection data DCT1 which has 
information on the memory area where speech frame data is 
stored and the memory area where noise frame data is stored 
is transferred to the controller 8. The flow then proceeds 
to a routine illustrated in FIG. 10. 

When the operation goes to the routine illustrated in 
FIG. 10, the average voice power P(n) , the average noise 
power NP(n) and the average S/N value SN(n) for each channel 
are computed first in step 200. Next, a candidate of speech 
frame data suitable for speech recognition is determined 
based on the speech condition table 6f shown in FIG. 7 in 
step 202. in the next step 204. noise frame data suitable 
for speech recognition is determined based on the noise 
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selection table 6g shown in FIG. 8. 

In step 206, the speech candidate data DCT2 that 
indicates the candidate of speech frame data suitable for 
speech recognition and the noise candidate data DCT3 that 
indicates the noise frame data are sent to the controller 8 
from the data analyzer 6. In other words, the speech 
candidate data DCT2 and the noise candidate data DCT3 inform 
the controller 8 of the candidate of speech frame data 
suitable for speech recognition and noise frame data 
suitable for speech recognition associated with that 
candidate . 

In the next step 208, the speech recognizer 7 read 
speech frame data and noise frame datajno^t--^uitable for 
speech recognition from the^tefage section 4. performs 

peech recognition 
frame data. 



ihe read speech frame data and noise 
terminates a sequence of speech recognition 
irocess^when an adequate speech recognition result is 
Squired . 

When no adequate speech recognition result is 
on the other hand, the speech recognizer 7 checks^-*n step 
212 if there are next candidates of speech--frame data and 
noise frame data, reads the next cjaridldates of speech frame 
data and noise frame data, J^present, from the storage 
section 4 and repeats^€he sequence of processes starting at 
step 208. Wherv^fxo adequate speech recognition result is 
obtained^even after re-execution of the speech recognition, 
the^speech recognizer 7 likewise reads next candidates of 
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speech frame data and noise frame data from thej] 
section 4 and repeats these5V^oe-T5f^processes in steps 208 
to 212 until^fefee^adequate speech recognition result is 
ohfeained. 

According to this embodiment, as apparent from the 
above, a plurality of microphones M 1 -M N for inputting voices 
are placed in a vehicle and speech frame data and noise 
frame data suitable for speech recognition are automatically 
extracted from those speech frame data and noise frame data 
that are picked up by the microphones N^-M,, and are subjected 
to speech recognition. This speech recognition system can 
therefore provide a plurality of speakers (passengers) with 
a better operability than the conventional speech 
recognition system that is designed for a single speaker. 

When one of a plurality of passengers directs a desired 
speech to a certain microphone (e.g., Mj . the uttered speech 
may generally be picked up by the other microphones (M 2 -M N ) 
so that it is difficult to determine which microphone has 
actually been intended to pick up the uttered speech. 
According to this embodiment, however, speech frame data and 
noise frame data suitable for speech recognition are 
automatically extracted by using the speech condition table 
6f and the noise selection table 6g, respectively shown in 
FIGS. 7 and 8, and speech recognition is carried out based 
on the extracted speech frame data and noise frame data. 
This makes it possible to associate the passenger who has 
made the speech with the microphone (e.g., M x ) close to that 
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passenger with a very high probability. 

Accordingly, this speech recognition system 
automatically specifies a passenger who tries to perform a 
voice-based manipulation of an electronic equipment 
5 installed in a vehicle and allows the optimal microphone 
(close to the passenger) to pick up the uttered speech. 
This can improve the speech recognition precision. With the 
use of this speech recognition system, a passenger requires 
a special manipulation but merely needs to utter words to 
10 give this or her voiced instruction through the appropriate 
microphone, so that this speech recognition system is 
considerably easy to use. 

Suppose that while one or more passengers who do not 
intend to perform a voice-based manipulation of an on-board 
15 electronic equipment are making a conversation or the like, 
one person utters words to perform such a voice-based 
manipulation. Even in this case, the conversation or the 
like made by the passengers who are not performing the 
voice -based manipulation is determined as noise and 
eliminated from consideration by automatically extracting 
speech frame data and noise frame data suitable for speech 
recognition by using the speech condition table 6f and the 
noise selection table 6g, respectively shown in FIGS. 7 and 
8, and then carrying out speech recognition based on the 
25 extracted speech frame data and noise frame data. This can 
provide a speech recognition system which is not affected by 
a conversation or the like taking place in a vehicle around 
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and which is very easy to use. 

Although this embodiment is provided with the single 
speech switch 9. shown in FIG. 1, which is switched on by, 
for example, a driver, this invention is not limited to this 
particular structure. For example, a plurality of 
microphones M 1 -M N may be respectively provided with speech 
switches TK L to TK N as shown in the block diagram in FIG. 11, 
so that when one of the speech switches is set on, the 
controller 8 allows the microphone that corresponds to the 
activated speech switch to pick up words and determines that 
the remaining microphones corresponding to the inactive 
speech switches have picked up noise in the vehicle. 

This modified structure can specify the microphone that 
has picked up an uttered speech and the microphones that 
have picked up noise before speech recognition. This can 
shorten the processing time for easily determining speech 
data and noise data most suitable for speech recognition. 

Further, the structure shown in FIG. 1 and the 
structure shown in FIG. 11 may be combined as needed. 
Specifically, speech switches smaller in number than the 
microphones 1VM N may be placed at adequate locations in a 
vehicle so that when one of the speech switches is set on, 
the controller 8 detects the event and initiates speech 
recognition. In this case, the speech switches do not 
completely correspond one-to-one to the microphones J^-M,,, so 
that while speech recognition is carried out with the 
structure shown in FIG. 1, the microphone that has picked up 



an uttered speech and the microphones that have picked up 
noise before speech recognition can be specified before 
speech recognition. This can shorten the processing time 
for determining speech data and noise data suitable for 
speech recognition. 

In the case where the structure in FIG. 1 is adapted to 
the case where speech switches smaller in number than the 
microphones M^M^ are provided, each speech switch may be 
determined as the layout range for the associated microphone 
or microphones and one or more microphones belonging to each 
layout range may be specified previously depending on which 
speech switch has been set on. With this structure, those 
which are suitable for speech recognition have only to be 
extracted from pre- specif ied single or plural speech frame 
data and noise frame data, thus making it possible to 
shorten the processing time. 

Although the foregoing description of this embodiment 
and modifications has been given of a speech recognition 
system adapted to an on-board electronic equipment, the 
speech recognition system of this invention can also be 
adapted to other types of electronic apparatuses, such as a 
general-purpose microcomputer system and a so-called word 
processor, to enable voice-based entry of sentences or 
voice -based document edition. 

According to this invention, in short, when a speaker 
makes a desired speech, a speech signal and a noise signal 
suitable for speech recognition are automatically determined 



from the individual speech signals output from a plurality 
of voice pickup sections (or voice pickup means) and speech 
recognition is carried out based on the determined speech 
signal and noise signal. Accordingly, the speaker has only 
to utter words or voices without consciously making such a 
speech to a specific voice pickup section. This leads to an 
improved operability of the speech recognition system. 

Although only one embodiment of the present invention 
and some modifications thereof have been described herein, 
it should be apparent to those skilled in the art that the 
present invention may be embodied in many other specific 
forms without departing from the spirit or scope of the 
invention. Therefore, the present examples and embodiment 
are to be considered as illustrative and not restrictive and 
the invention is not to be limited to the details given 
herein, but may be modified within the scope of the appended 
claims . 



